AITopics | Manipulation

Language-Conditioned Imitation Learning for Robot Manipulation Tasks Simon Stepputtis 1 Joseph Campbell 1 Mariano Phielipp 2 Stefan Lee

Neural Information Processing SystemsMay-30-2025, 11:18:47 GMT

Imitation learning is a popular approach for teaching motor skills to robots. However, most approaches focus on extracting policy parameters from execution traces alone (i.e., motion trajectories and perceptual data). No adequate communication channel exists between the human expert and the robot to describe critical aspects of the task, such as the properties of the target object or the intended shape of the motion. Motivated by insights into the human teaching process, we introduce a method for incorporating unstructured natural language into imitation learning. At training time, the expert can provide demonstrations along with verbal descriptions in order to describe the underlying intent (e.g., "go to the large green bowl").

artificial intelligence, machine learning, robot, (20 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania (0.14)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots > Manipulation (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

ActionSense: A Multimodal Dataset and Recording Framework for Human Activities Using Wearable Sensors in a Kitchen Environment

Joseph DelPreto, Chao Liu, Yiyue Luo, Michael Foshey, Yunzhu Li, Antonio Torralba, Wojciech Matusik, Daniela Rus

Neural Information Processing SystemsMay-30-2025, 00:38:20 GMT

This paper introduces ActionSense, a multimodal dataset and recording framework with an emphasis on wearable sensing in a kitchen environment. It provides rich, synchronized data streams along with ground truth data to facilitate learning pipelines that could extract insights about how humans interact with the physical world during activities of daily living, and help lead to more capable and collaborative robot assistants. The wearable sensing suite captures motion, force, and attention information; it includes eye tracking with a first-person camera, forearm muscle activity sensors, a body-tracking system using 17 inertial sensors, finger-tracking gloves, and custom tactile sensors on the hands that use a matrix of conductive threads. This is coupled with activity labels and with externallycaptured data from multiple RGB cameras, a depth camera, and microphones. The specific tasks recorded in ActionSense are designed to highlight lower-level physical skills and higher-level scene reasoning or action planning. They include simple object manipulations (e.g., stacking plates), dexterous actions (e.g., peeling or cutting vegetables), and complex action sequences (e.g., setting a table or loading a dishwasher).

artificial intelligence, machine learning, sensor, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Industry: Information Technology (0.46)

Technology:

Information Technology > Human Computer Interaction > Interfaces (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.46)
(2 more...)

Add feedback

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Neural Information Processing SystemsMay-29-2025, 09:52:43 GMT

Recent advancements utilizing large-scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics-aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two-stage training mechanism inspired by dual-process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre-trained on the Open X-Embodiment dataset (OXE) for predicting future visual trajectories in a video denoising diffusion manner, enabling the model to develop a long horizontal awareness of the environment's dynamics. In the second stage, a flexible yet effective layer-wise self-attention adapter is introduced to transform VidMan into an efficient inverse dynamics model that predicts action modulated by the implicit dynamics knowledge via parameter sharing. Our VidMan framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset. These results provide compelling evidence that world models can significantly enhance the precision of robot action prediction. Codes and models will be public.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.46)
Energy > Oil & Gas (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.92)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.81)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation Gengshan Yang

Neural Information Processing SystemsMay-29-2025, 01:48:24 GMT

However, they often fail to produce realistic geometric details, resulting in overly smooth surfaces or geometric details inaccurately baked in albedo maps. To address this, we introduce a new method that incorporates touch as an additional modality to improve the geometric details of generated 3D assets. We design a lightweight 3D texture field to synthesize visual and tactile textures, guided by 2D diffusion model priors on both visual and tactile domains. We condition the visual texture generation on high-resolution tactile normals and guide the patch-based tactile texture refinement with a customized TextureDreambooth. We further present a multi-part generation pipeline that enables us to synthesize different textures across various regions. To our knowledge, we are the first to leverage high-resolution tactile sensing to enhance geometric details for 3D generation tasks. We evaluate our method in both text-to-3D and image-to-3D settings. Our experiments demonstrate that our method provides customized and realistic fine geometric textures while maintaining accurate alignment between two modalities of vision and touch.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.14)
Asia (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.70)

Add feedback

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids Zhengyi Luo 1,2 Sammy Christen 2,3 Alexander Winkler 2

Neural Information Processing SystemsMay-28-2025, 07:32:02 GMT

We present a method for controlling a simulated humanoid to grasp an object and move it to follow an object's trajectory. Due to the challenges in controlling a humanoid with dexterous hands, prior methods often use a disembodied hand and only consider vertical lifts or short trajectories. This limited scope hampers their applicability for object manipulation required for animation and simulation. To close this gap, we learn a controller that can pick up a large number (>1200) of objects and carry them to follow randomly generated trajectories. Our key insight is to leverage a humanoid motion representation that provides human-like motor skills and significantly speeds up training.

machine learning, natural language, trajectory, (16 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

Appendix A Implementation Details

Neural Information Processing SystemsMay-25-2025, 15:33:35 GMT

We are also committed to releasing the code. Implementation details for Stage 2. Our implementation strictly follows the previous work that also In this section, we briefly introduce our tasks. Mean of 3 seeds with seed number 0, 1, 2. Shaded area indicates 95% CIs. It requires the robot hand to open the door on the table. It requires the robot hand to orient the pen to the target orientation. It requires the robot hand to place the object on the table into the mug.

artificial intelligence, hyperparameter, robot hand, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Robots > Manipulation (0.78)

Add feedback

H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation Jiaxin Qin 4

Neural Information Processing SystemsMay-25-2025, 15:33:32 GMT

Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human Hand-Informed visual representation learning framework to solve difficult Dexterous manipulation tasks (H-InDex) with reinforcement learning. Our framework consists of three stages: (i) pre-training representations with 3D human hand pose estimation, (ii) offline adapting representations with self-supervised keypoint detection, and (iii) reinforcement learning with exponential moving average BatchNorm. The last two stages only modify 0.36% parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study 12 challenging dexterous manipulation tasks and find that H-InDex largely surpasses strong baseline methods and the recent visual foundation models for motor control. Code is available at yanjieze.com/H-InDex.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Technology:

Information Technology > Artificial Intelligence > Robots > Manipulation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation Gengshan Yang

Neural Information Processing SystemsMay-23-2025, 05:02:54 GMT

However, they often fail to produce realistic geometric details, resulting in overly smooth surfaces or geometric details inaccurately baked in albedo maps. To address this, we introduce a new method that incorporates touch as an additional modality to improve the geometric details of generated 3D assets. We design a lightweight 3D texture field to synthesize visual and tactile textures, guided by 2D diffusion model priors on both visual and tactile domains. We condition the visual texture generation on high-resolution tactile normals and guide the patch-based tactile texture refinement with a customized TextureDreambooth. We further present a multi-part generation pipeline that enables us to synthesize different textures across various regions. To our knowledge, we are the first to leverage high-resolution tactile sensing to enhance geometric details for 3D generation tasks. We evaluate our method in both text-to-3D and image-to-3D settings. Our experiments demonstrate that our method provides customized and realistic fine geometric textures while maintaining accurate alignment between two modalities of vision and touch.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.14)
Asia (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.70)

Add feedback

AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World

Zhou, Zhiyuan, Atreya, Pranav, Tan, You Liang, Pertsch, Karl, Levine, Sergey

arXiv.org Artificial IntelligenceApr-2-2025

Scalable and reproducible policy evaluation has been a long-standing challenge in robot learning. Evaluations are critical to assess progress and build better policies, but evaluation in the real world, especially at a scale that would provide statistically reliable results, is costly in terms of human time and hard to obtain. Evaluation of increasingly generalist robot policies requires an increasingly diverse repertoire of evaluation environments, making the evaluation bottleneck even more pronounced. To make real-world evaluation of robotic policies more practical, we propose AutoEval, a system to autonomously evaluate generalist robot policies around the clock with minimal human intervention. Users interact with AutoEval by submitting evaluation jobs to the AutoEval queue, much like how software jobs are submitted with a cluster scheduling system, and AutoEval will schedule the policies for evaluation within a framework supplying automatic success detection and automatic scene resets. We show that AutoEval can nearly fully eliminate human involvement in the evaluation process, permitting around the clock evaluations, and the evaluation results correspond closely to ground truth evaluations conducted by hand. To facilitate the evaluation of generalist policies in the robotics community, we provide public access to multiple AutoEval scenes in the popular BridgeData robot setup with WidowX robot arms. In the future, we hope that AutoEval scenes can be set up across institutions to form a diverse and distributed evaluation network.

artificial intelligence, autoeval, evaluation, (11 more...)

arXiv.org Artificial Intelligence

2503.24278

Country: Asia (0.28)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots > Manipulation (0.52)
Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.34)

Add feedback

Grasping a Handful: Sequential Multi-Object Dexterous Grasp Generation

Lu, Haofei, Dong, Yifei, Weng, Zehang, Lundell, Jens, Kragic, Danica

arXiv.org Artificial IntelligenceMar-31-2025

-- We introduce the sequential multi-object robotic grasp sampling algorithm SeqGrasp that can robustly synthesize stable grasps on diverse objects using the robotic hand's partial Degrees of Freedom (DoF). We use SeqGrasp to construct the large-scale Allegro Hand sequential grasping dataset SeqDataset and use it for training the diffusion-based sequential grasp generator SeqDiffuser . We experimentally evaluate SeqGrasp and SeqDiffuser against the state-of-the-art non-sequential multi-object grasp generation method Multi-Grasp in simulation and on a real robot. Furthermore, SeqDiffuser is approximately 1000 times faster at generating grasps than SeqGrasp and MultiGrasp. Generation of dexterous grasps has been studied for a long time, both from a technical perspective on generating grasps on robots [1]-[11] and understanding human grasping [12]- [15]. Most of these methods rely on bringing the robotic hand close to the object and then simultaneously enveloping it with all fingers. While this strategy often results in efficient and successful grasp generation, it simplifies dexterous grasping to resemble parallel-jaw grasping, thereby underutilizing the many DoF of multi-fingered robotic hands [10]. In contrast, grasping multiple objects with a robotic hand, particularly in a sequential manner that mirrors human-like dexterity, as shown in Figure 1, is still an unsolved problem. In this work, we introduce SeqGrasp, a novel hand-agnostic algorithm for generating sequential multi-object grasps.

artificial intelligence, seqdataset, seqdiffuser, (11 more...)

arXiv.org Artificial Intelligence

2503.2237

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Robots > Manipulation (1.00)

Add feedback

Filters

Collaborating Authors

Manipulation

Language-Conditioned Imitation Learning for Robot Manipulation Tasks Simon Stepputtis 1 Joseph Campbell 1 Mariano Phielipp 2 Stefan Lee

ActionSense: A Multimodal Dataset and Recording Framework for Human Activities Using Wearable Sensors in a Kitchen Environment

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation Gengshan Yang

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids Zhengyi Luo 1,2 Sammy Christen 2,3 Alexander Winkler 2

Appendix A Implementation Details

H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation Jiaxin Qin 4

Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation Gengshan Yang

AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World

Grasping a Handful: Sequential Multi-Object Dexterous Grasp Generation